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ABSTRACT 


1. INTRODUCTION 


This paper focuses on the problem of simultaneous sample 
and feature selection for machine learning in a fully unsu¬ 
pervised setting. Though most existing works tackle these 
two problems separately that derives two well-studied sub- 
areas namely active learning and feature selection, a uni¬ 
fied approach is inspirational since they are often interleaved 
with each other. Noisy and high-dimensional features will 
bring adverse effect on sample selection, while ‘good’ sam¬ 
ples will be beneficial to feature selection. We present a 
unified framework to conduct active learning and feature 
selection simultaneously. From the data reconstruction per¬ 
spective, both the selected samples and features can best 
approximate the original dataset respectively, such that the 
selected samples characterized by the selected features are 
very representative. Additionally our method is one-shot 
without iteratively selecting samples for progressive label¬ 
ing. Thus our model is especially suitable when the initial 
labeled samples are scarce or totally absent, which existing 
works hardly address particularly for simultaneous feature 
selection. To alleviate the NP-hardness of the raw problem, 
the proposed formulation involves a convex but non-smooth 
optimization problem. We solve it efficiently by an iterative 
algorithm, and prove its global convergence. Experiments 
on publicly available datasets validate that our method is 
promising compared with the state-of-the-arts. 

Categories and Subject Descriptors 

H.2.8 [Database Management]: Database Applications— 
data mining ; 1.2.6 [Artificial Intelligence]: Learning 

Keywords 

Active learning, feature selection, matrix factorization 


In many real-life machine learning tasks, unlabeled data is 
often easily available whereas labeled data is scarce. In order 
to build powerful predictive models, one usually requires 
domain experts to manually annotate samples, but this is an 
expensive and time-consuming procedure. Active learning 
9 provides a means to alleviate this problem by carefully 
selecting samples to be labeled by experts. Typically, the 
active learning algorithms prefer to query those unlabeled 
samples which can improve the prediction performance the 
most if they were labeled and used as training data. In this 
way, the active learner aims to pick out as few samples as 
possible to label for minimizing the total annotating cost, 
while an accurate supervised learning model can be built 
based on these labeled data. 

Many active learning algorithms have been proposed [7| 
28j[l6|[l9 in the past decade. There are two main group 


methods for selecting unlabeled samples to label [29]: One 
is to select the most informative samples, such as uncer¬ 
tainty sampling [18], query by committee [9], and empirical 
risk minimization [28]. These algorithms are implemented 
iteratively, where a model is learned with the existing la¬ 
beled data and new samples are chosen to be labeled based 
on the learned model. Since training model usually needs a 
large number of labeled data to avoid the samples bias, the 
above methods should be used after sufficient labeled sam¬ 
ples are collected [24]. The other group aims at querying the 
most representative samples from a perspective of data re¬ 
construction [31 [ [6] [4] 24] [l5]. Different from the first group, 
methods in this group are one-shot and non-iterative for se¬ 
lecting samples. Such active learning methods are usually 
applied when there is no initial labeled data. 

Although active learning has been well studied for years, 
it still has some issues in many real-world scenarios. For ex¬ 
ample, the sample is often characterized by high-dimensional 
features, and some of features are often noisy or irrelevant. 
These noisy or irrelevant features bring adverse influence on 
selecting informative or representative samples. Moreover, 
after querying samples, some supervised learning models, 
such as decision tree, are often trained based on these labeled 
data for various applications. However, high-dimensional 
features significantly increase the time and space require¬ 
ments for model training. Meanwhile, when only limited 






labeled samples are available, it is difficult to guarantee reli¬ 
able model parameter estimates in high-dimensional feature 
space. One may state that, if we apply some state-of-the-art 
feature selection techniques, such as Q — a 30], to learn a 
low-dimensional representation before active learning, these 
problems might be solved. Of course, this should be help¬ 
ful for active learning to some extent, while common fea¬ 
ture selection techniques and active learning algorithms are 
independent in designing, directly combining them usually 
cannot guarantee to obtain the optimal results. Therefore, 
it will benefit from devising principled model and algorithm 
for incorporating active learning and feature selection in a 
unified fashion. Recently, Raghavan et al. [26] presented a 
method to use human feedback on both features and sam¬ 
ples for active learning. Kong et al. [17 proposed a dual 
feature selection and sample selection method in the con¬ 
text of graph classification. Bilgic 1 proposed a dynamic 
dimensionality reduction algorithm that determined the ap¬ 
propriate number of dimensions for each active learning it¬ 
eration. Since all of the above three algorithms are imple¬ 
mented iteratively, and need to train models for querying 
in each iteration, they are suitable to work in the scenarios 
of the first group active learning methods. Different from 
them, we focus on learning important features for the sec¬ 
ond group active learning algorithms, i.e., in the case when 
no initial labeled samples are available. This is an unsu¬ 
pervised learning problem, which is much harder due to the 
absence of labels that would guide the search for relevant 
information. 

In this paper, we present a unified view of (sampled based) 
Active Learning and Feature Selection, called ALFS, which 
is inspired by the approximation method for CUR matrix 
decomposition. 

The main contributions of this paper are: 

i) To our knowledge, this is the first work for presenting 
a unified view for one-shot active learning and feature 
selection, which is important for real-world applica¬ 
tions, since it dispenses with any label effort unlike 
those progressive interactive labeling active learning 
methods. 

ii) This work is the first one to formulate and build the 
natural connection between CUR decomposition and 
simultaneous sample and feature selection. 

iii) We devise a novel model and convex optimization algo¬ 
rithm to solve the one-shot sample and feature learning 
problem. 

iv) The convergence of the proposed iterative algorithm 
is theoretically proved, and extensive empirical results 
demonstrate the advantages of our approach. 

The rest of this paper is organized as follows: we propose 
a unified framework to conduct active learning and feature 
selection simultaneously in section 2. Section 3 reviews re¬ 
lated work on the second group active learning algorithms. 
The experimental results are reported in Section 4. Section 
5 presents concluding remarks and future work. 

Notations. In this paper, matrices are written as bold¬ 
face uppercase letters and vectors are written as boldface 
lowercase letters. Given a matrix P, we denote its ( i,j)~ 
th entry, i-th row, j -th column as P ij, P\ Pj, respectively. 
The only used vector norm is the U norm, denoted by 11 • 11 2 - 


A variety of norms on matrices will be used. The / 1 , h,i, 
Iog norms of a matrix are defined by ||P||i = JT . |P^-|, 

npiki = zr=i Gt=i p y = ziuiipik and iipiioo = 

maxi ;j - \Pijl respectively. The quasi-norm h,o norm of a 
matrix P is defined as the number of the nonzero rows of 
P, denoted by ||P ||2,o- The Frobenius norm and the nuclear 
norm (the summation of singular values of a matrix) are de¬ 
noted by ||P|| F and ||P||*, respectively. The Euclidean inner 
product between two matrices is (P,Q) = tr(P T Q), where 
P T is the transpose of the matrix P and tr(-) is the trace of 
a matrix. The rank of a matrix is denoted by rank(-). 

2. PROPOSED METHOD 

Given an unlabeled dataset X = [xi,...,x n ] £ R dXn , 
our goal is to pick out m (m < n) samples for labeling by 
user, and simultaneously select r (r < d) features as the 
new feature representation, such that the potential perfor¬ 
mance is maximized when the model is trained based on 
the selected m labeled samples under the new representa¬ 
tion. This is a more challenging problem than traditional 
representativeness based active learning problems, because 
selecting m samples to best approximate X often leads to 
an NP-hard problem 31 , and finding r features as the most 
representative feature subset is also often NP-hard j~j~4] . 

2.1 Active Learning and Feature Selection via 
Matrix Decomposition 

Inspired by the CUR matrix decomposition [ 2 ], we pro¬ 
pose a unified framework to find the most representative 
samples and features. To make this paper self-contained, we 
first introduce CUR matrix factorization. 

Definition 2.1. Given X £ R dXn of rank p = rank(fX), 
rank parameter k < p , and accuracy parameter 0 < £ < 1, 
the CUR factorization for X aims to find C £ R dXm with 
m columns from X, R £ R rXn with r rows of X, and U £ 
R mXr , with m, r, and rank( U) being as small as possible, 
such that X is reconstructed within relative-error: 

||X-CUR||! < (l + e)||X-X fc ||S., (1) 

where X/t = U^EfcVjT € R dXn is the best rank k matrix 
obtained via the SVD of X. 

From an algorithmic perspective, the matrices C, U, and 
R can be obtained by minimizing the approximation error 
||X — CUR||fr. Here we make a key observation that the 
above definition is closely related to the problem of simul¬ 
taneous sample and feature selection, though to our sur¬ 
prise, existing works hardly point out or explore this con¬ 
nection to solve the active learning problem: on one hand, 
UR can be regarded as a reconstruction coefficient matrix, 
and C denotes the selected m samples, thus minimizing 
||X - CURHfr means that the total reconstruction error 
is minimized, which can make the data points listed in C 
be the most representative. The reconstruction coefficients 
UR are related to an r-dimensional feature subset of the 
dataset. Actually, the reconstruction coefficients of each re¬ 
constructed data point x* are formed by a linear combina¬ 
tion of its r features. On the other hand, CU can be also 
regarded as a reconstruction coefficient matrix, and R is 
the new low-dimensional representation of X, so minimiz¬ 
ing ||X — CUR Uli also indicates that the selected r features 



can represent the whole dataset most precisely. The con¬ 
struction of the coefficient matrix CU depends on a sample 
subset of X. Clearly, active learning and feature selection 
can be conducted simultaneously in such a joint framework 
via CUR factorization. 

Despite the above connection from CUR decomposition to 
feature selection and active learning, the original CUR for¬ 
mulation and its existing solvers can not be directly applied 
to solve the simultaneous feature and sample selection task 
due to the under-determination of a general CUR model. In 
the context of active sample/feature learning, this paper pro¬ 
poses a tailored objective function rooted from CUR decom¬ 
position, while being more informative by adding regulariza¬ 
tion terms to incorporate prior knowledge. Moreover, unlike 
most existing CUR solvers being randomized or heuristic al¬ 
gorithms [22j [5], we utilize the structured sparsity-inducing 
norms to relax the objective from a non-convex optimiza¬ 
tion problem to a convex one, which allows for devising an 
efficient variant of the alternating direction method of mul- 
tipliers (ADMM) [lO . 

2.2 A Convex Formulation 

Let p = (pi,... ,p n ) T € {0, l} n and q = ( qi , .. ■ ,qd) T € 
{ 0 , 1 } d denote two indicator variables to represent whether a 
sample and a feature is selected or not, respectively, pi = 1 
(or 0 ) indicates that the z-th sample is selected (or not), and 
qi — 1 (or 0 ) means that the z-th feature is selected (or not). 
Minimizing ||X — CUR||^ can be re-written as: 

min ||X - Xdiag(p)Udiag(q)X||^ 

p,q,UelR 77 ' xd 

s -t- InP = m,p € {0, l} n , (2) 

ld<l = r,qe { 0 , 1 }-, 

where diag(p) is a diagonal matrix with its diagonal el¬ 
ements being p, and l n is an rz-dimensional vector with 
all components being 1. Xdiag(p) in § aims to make m 
columns of X unchanged, and reset the rest n — m columns 
to zero vectors. diag(q)X tends to keep r rows of X un¬ 
changed, and reset the rest (d — r) rows to zero vectors. 

Using the matrix U,o norm, we formulate the problem in 

([5]) as: 

min ||X-XWX||! 

weM nxd 

S.t. ||W|| 2 ,o = TO, ||W T || 2 ,0 = r, (3) 

where W = diag(p)Udiag(q). 

In Definition EH the rank of the matrix U should be as 
small as possible. Based on this point and § , we propose 
to optimize the following objective function: 

min ||X - XWX||| + a||W|| 2 ,o + /3||W T || 2 , 0 

weM nxd 

+ yrank(W), (4) 

where a > 0 , /3 > 0 , and 7 > 0 are three regularization pa¬ 
rameters. The first three terms in Q aim to select a sample 
subset and a feature subset to minimize the reconstruction 
error. The last term purposes to make W be a low-rank 
matrix. 

However, is still an NP-hard problem due to the ma¬ 
trix I 2 ,0 norm and the combinational nature of the rank 
function. Fortunately, there exists theoretical progress that 


|| W || 2,1 is the minimum convex hull of 11W11 2,0 [24]. The re¬ 
sult of minimizing ||W|| 2 ,i is the same as that of minimizing 
||W|| 2 , 0 , as long as W is row-sparse enough. Meanwhile, it 
has been proved that the convex envelope of rank(W) on the 
set {W G R nXd : <ti(W) < 1} is the nuclear norm ||W||* [5], 
where cri(W) is the largest singular value of W. In other 
words, the nuclear norm is the best convex approximation 
of the rank function over the unit ball of matrices with the 
largest singular values less than one. Therefore, Q can be 
relaxed to the following convex optimization problem: 

min ||X - XWX||| + a||W|| 2 ,i + £I|W t || 2 ,i 

weR nXd 

+ 7||W||*. (5) 

2.3 Local Linear Reconstruction 

In the new objective function ([sj, we can see that each 
data point is reconstructed by a linear combination of all 
the selected points (when the z-th row of the reconstruction 
coefficient matrix WX in © is not a zero vector, x* is cho¬ 
sen as one of the most representative samples. Otherwise, x* 
is not selected). However, it is more reasonable to suppose 
that a data point can be mainly recovered from its neighbors 
15 , [4]. Intuitively, if the distance between the reconstructed 
point and the selected point is large, the contribution of the 
selected point should be small to the reconstruction of the 
target point, and thus the reconstruction coefficient should 
be penalized. In light of this point, we incorporate a regu¬ 
larization term into H as: 

min ||X - XWX||| + a||W|| 2 ,i + /3 ||W t || 2 ,i 

weR nxd 

+ 7 ||W||* + 7 7 ||T©(WX)|| 1 , ( 6 ) 

where 77 > 0 is a regularization parameter, and © denotes the 
element-wise multiplication of two matrices. T is a weight 
matrix, where T ij encodes the distance between the z-th 
and j -th samples. From the data reconstruction perspec¬ 
tive, if two unit vectors have the same or opposite directions, 
their distance should be minimal, since either vector can be 
fully recovered by the other one; on the contrary, if the two 
vectors are orthogonal, their distance should be maximal, 
because they have little contribution to each other’s recon¬ 
struction. Therefore, we use the absolute value of the cosine 
function of the angle between two feature vectors to measure 
their similarity, and define the inverse of the absolute value 
as their distance: 

Ty = 1777^’ ( 7 ) 

| COS Uij I 

where Qij denotes the angle between x* and x-Q 

After obtaining the optimal W in ([ 6 ]), we can sort all the 
samples by the I 2 norm of the rows of W in descending 
order, and select the top m samples as the representative 
ones. Similarly, we rank all the features by the I 2 norm of 
the columns of W in descending order, and choose the top 
r features to represent the samples. 

We take the FG-NET datasel[jas an example to illustrate 
the effectiveness of the U,i norm constraint on W and W T 
in <§• Figure [I] (a) and (b) are the visualizations of Z 2,1 

1 When cos Oij — 0 , we can regularize T ij as T ij = 

| cos b• -|+ g ? where is a very small positive constant. 

2 The dataset is available at http://sting.cycollege.ac. 
cy/alanitis/fgnetaging/index.htm 






Figure 1: The visualization of the learned W on the 
FG-NET dataset, (a) Each row is the I 2 norm value 
of each row of W. (b) Each column is the I 2 norm 
value of each column of W. Dark blue denotes that 
the values are close to zero. 


Algorithm 1 L-BFGS for the subproblem about W 
Input: Starting point po, an integer m > 0, and a 
symmetric and positive definite matrix Ho. 

Initialize: k — 0. 

Repeat 

1. computing dk H^V/fc using a two-loop recursion, 
where V/& is the sub-gradient of /(p) at p/t; 

2. computing p/c+i p k — &kdk, where at satisfies 

the Wolfe conditions; 

3. Sfc <— Pfc+i — Pfc; 

4. y k <- V/fc+i - V/fc; 

5 . m <— min{/c, m — 1}; 

6. updating Hessian matrix H k using the pairs 

{y 3 5 ®3 }j =k — rh 5 

7. k 4 — k -\- 1; 

Until Convergence criterion satisfied. 

Output: pfc. 


norm of W and W T , respectively. Many rows and columns 
in W become sparse by adding the ^ 2,1 norm constraints 
on W and W T , which means that W can conduct sample 
selection and feature selection simultaneously. 

2.4 Optimization Algorithm 

Although the problem © is convex, it is not easy to be 
solved by sub-gradient type methods since different struc¬ 
tured non-smooth terms are involved. In this section, we em¬ 
ploy the alternating direction method of multipliers (ADMM) 
[To] to solve <©• Theoretical results will be given then in¬ 
cluding the global convergence and iteration complexity. 

In order to solve j6l) , we first introduce two variables Z 
and Wi to convert mv to the following equivalent objective: 

min ||X - XWX||! + a||W|| 2 ,i + /?||W t || 2 ,i 

w.w.z 

+ ^||W||, +T 7 IITOZIU 

s.t. WX = Z,W = W. (8) 

The augmented Lagrange function of © is 

C PUP2 ( W,Z,W,A 1; A 2 ) := ||X-XWX||! 

+ a||W|| 2 ,i + /3||W T || 2 ,1 + 7 ||W||. + 77IIT © Z||i 
+ (Ai,WX - Z) + (A 2 , W - W) + ^||WX - Z\\% 

+ fl|W-W|||, (9) 

where Ai and A 2 are Lagrange multipliers, pi and p 2 are 
the constraint violation penalty parameters. From the aug¬ 
mented Lagrangian function, we can find that the subprob¬ 
lems about Z and W are totally separable, as a result we can 
introduce the classical two-block ADMM here, while consid¬ 
ering W and (Z, W) as two-block variables. Next, we will 
introduce how to solve these subproblems in detail. 

Compute the subproblem about W: When the other vari¬ 
ables are fixed with the former iteration result (Z fe , W k , A{, A$), 
the subproblem about W is as follows: 

W fc+1 = arg min C P1 , P2 (W, Z k , W k , Af, Af) 

w 

= arg min ||X - XWX||| + a||W|| 2 ,i + /3 ||W t || 2 ,i 

w 

+ £||WX -Z k + ^-W 2 F + ^||W - W fc + ^-f F . (10) 

Z pi Z p2 


Since these is no closed-from solution of W, so we adopt 
a gradient-based method to derive the optimal W. Here we 
choose the limited-memory BFGS (L-BFGS) algorithm, due 
to its efficiency for large-scale optimization problems 2l], 
which is outlined in Algorithm 1. 

Computing the subproblem about Z: When fixing the 
other variables, we can update Z by 


Z k+1 = arg mm C P1 , P2 (W k+1 ,Z, W k , Af, A|) 

A k 

= argmin^T © Z||i + £||Z - W fc+1 X - ^-|||. (11) 

z z p! 


The problem (11) can be solved by the following matrix 
shrinkage operation Lemma 20 : 


Lemma 2.1. For p > 0, and K E R sXt , the solution of 
the problem 


min 

LeM s x t 


A*INi + I||L 


Kll 


2 

F 5 


is given by L M (K) E R sXt , which is defined componentwisely 
by 


:=max{|Ky| - p, 0} • sgn(Kij), (12) 


where sgn(t ) is the signum function of t E R, i.e., 

f +1 if t > 0, 
sgnft ) := < 0 if t = 0, 

[ — 1 if t < 0. 

Based on Lemma [2.1| we can obtain a closed-form solution 
of Z whose (i,j )~th entry is expressed as 

Z y :==max{|(W fc+1 X + ~)ij\ ~ ^Wo} 

■sgn(( W fc+1 X+^)y). (13) 

Pi 

Computing the subproblem about W: When the other 
variables are fixed, W can be updated as 

W k+1 = arg min C Pl , P2 (W fc+1 , Z fc+1 , W, Af, Af) 

w 

A k 

= argmin 7 ||W||* + £||W - W fc+1 - ^||%. (14) 

W Z p2 




































In order to solve the subproblem ( |14| ), we first introduce 
the following nuclear norm based shrinkage operation Lemma 

& 

Lemma 2.2. Let L fJb ( K) be defined as in \l2\ , K £ R sXt 
whose rank is l, and /i > 0, the solution of the problem 

min M||L||, + i||L-K|||, 

LGM fi Xt Z 

is given by D M (K) £ R sXt ; which is defined by 
EV(K) := Udiag(L^))V T , 

where U € R sxi , V € M. txl , and E £ R ixi by the singular 
value decomposition (SVD) o/K: 

K = UXV T and S = diag(ai, a 2 ,..., cri) 


Based on Lemma|2.2| we can obtain a closed form solution 


of (14). 

The key steps of the proposed ALFS algorithm are sum¬ 
marized in Algorithm 2 . We can also extend our method to 
the kernel version by defining a new data representation to 
incorporate the kernel information as in [33]. 


Algorithm 2 The ALFS Algorithm 


T) dXr 


parameter a , /3, 7 , 


Input: The data matrix X £ ! 
and 77 . 

Initialize: W° = W° = 0, Z° = 0, A? = 0, A° = 0, 

pi = p 2 = 10 - 6 ,max p = lO 10 , r = 1.1, e = 10 — 3 , k = 0. 
while not converged do 

1. fix the other variables and update W fc+1 by 
Algorithm 1 ; 


2. fix the other variables and update Z k+1 by (13); 

3. fix the other variables and update W fc+1 by 

W fc+1 = argmm 7 ||W||» + ^||W - W fc+1 - ^|||; 

w P2 


4. update the multipliers 


A 


k +1 


zfe + lN 


H = Af +pi(W fe+ 1 X-Z ft 

K k 2 +1 = \ k 2 + p 2 (w k+1 - w k+1 y, 

5. update the parameter pi and p 2 by 

pi min(rpi,max p ), 
p 2 — min(r p 2 , max p ); 

6 . k <— k + 1 ; 

7. check the convergence conditions 
||W fc X - Z fc ||oo < e and ||W fc - W fc | 


< e and 


l /(W fc )-/(W fc ~ 1 ) l 


< e, where /(W fc ) is the 


objective function value of (| 6 f at the point W fc . 

end while 

Output: The matrix W fc £ R nXd . 


2.5 Algorithm Analysis 

From the framework of ALFS, we can find that Algorithm 
2 is the direct application of the classical two-block ADMM, 
although the problem has more than two block variables. All 
the subproblems in Algorithm 2 have closed-form solutions 
except W, which is computed by the L-BFGS algorithm. 
Based on the classical convergence results, we can obtain 
the global convergence of Algorithm 2 to the primal-dual 
optimal solution of problem © (see [3, [ll]). Further in [l2] 
|13] , He & Yuan proves two kinds of iteration complexity 


results of ADMM, in ergodic and non-ergodic case. They 
present the global convergence rate by theoretically calcu¬ 
lating the optimality condition gap after k iterations. In the 
following we present both the global convergence and the 
iteration complexity results of Algorithm 2 . 

Theorem 2.1. For given constant parameters a, [3, 7 , 77 
and given constant penalty parameters pi, p 2 . Denote the 
iteration sequence generated by Algorithm 2 as 

»{w‘,Z\w‘,At,AS},£‘ = ^1?. 

t =0 


E k _ 

1 ; — 


-{ 


w k ,z k ,w k 


},sg : = {Ai.AS}. 


Then we have the following results: 

1. (Global Convergence) The sequence {X fc } converges to 

a primal-dual optimal solution pair (W^Z^W^AJ^A^ 0 ), 
where (W^Z^W 00 ) is the global optimal solution of 


problem 

problem 


and W°° is the global optimal solution of 


2. (Constraint Satisfactory) Both constraint violations will 
converge to zero, e.g. 


w fc x-z fc 


0, 


\w k -w k \ 


0. 


3. (Ergodic Iteration Complexity) Let (W*,Z*, W*,AJ ; A 2 ) 
be an optimal solution pair, we have 

£ PllM (Ef,E2)-£ pllM (El,f!S)< (15) 

where C\ denotes a constant related with E° and X*. 

4■ (Non-ergodic Iteration Complexity) The non-ergodic it¬ 
eration complexity can be written as 

IIE* - E fc+1 || 2 < -T-, (16) 

II \\n~ k + 1 y ’ 

where C 2 also denotes a constant related with E° and 
X* and TL is a matrix related with X as follows, 

0 0 0 0 \ 

pil 0 

p 2 I ' • 0 

^1 0 

0 0 0 ^1 / 

P 2 / 


u = 


( piX T X + p 2 ] 

0 

0 

0 
0 


We do not present the detailed proof here, because this 
theorem is just the basic theoretical results in [Tl 12 . The 
first and second parts of this theorem show the global con¬ 
vergence of the presented algorithm, including sequence con¬ 
vergence and constraint convergence. From the first part, 
we can find that the sequence converges to the primal-dual 
optimal solution pair, while the second part shows the two 
linear constraints converge to zero in the sense of Frobenius 
norm. 

The third and fourth parts above show a global conver¬ 
gence speed of ADMM, in the sense of ergodic and non- 
ergodic respectively. (15) denotes the ergodic iteration com¬ 
plexity, which denotes the characterization of e-optimal based 

















on primal-dual optimality gap as follows, 


Gap(X)i, X 2 ) £ p1 , P2 (5]i, X 2 ) — £ P1 , P2 (5] 1 , X] 2 ) 

< £• (17) 

Thus, it means that after k iterations, we can obtain an 
0(1/ /c)-optimal solution. (16) calculates the optimality con¬ 
dition between adjacent iterations, although this can not in¬ 
dicate convergence, but it really can accelerate the global 
convergence. 

The above theorem not only shows the global conver¬ 
gence of Algorithm 2, but also presents two cases of iter¬ 
ation complexity. The global convergence means that the 
generated sequence converges to the optimal solution based 
on any initial point. Further the iteration complexity results 
mean that how good the iteration result is after k iterations. 
We can also find that both iteration complexity results are 
0(1/A;), which is in the same order with many first-order 
algorithms. 


3. RELATED WORK 

As described, the work most related to our proposed ap¬ 
proach is the second group active learning methods that 
intend to select the most representative samples. In this 
section, we will briefly provide a review of the approaches 
in this group. Among them, the most popular one is the 
Transductive Experimental Design (TED) [3l]. TED aimed 
to find a representative sample subset from the unlabeled 
dataset, such that the dataset can be best approximated 
by linear combinations of the selected samples. Since this 
optimization problem is NP-hard, 31 proposed a subop- 
timal sequential optimization algorithm and a non-greedy 
optimization algorithm to solve it, respectively. 

Following TED, more active learning algorithms have been 
developed. Cai and He [ 3 ] extended TED to choose samples 
by utilizing a nearest neighbor graph to capture intrinsic 
local manifold structure, where the graph Laplacian is in¬ 
corporated into a manifold adaptive kernel space. Zhang 
et al. 32 adopted the idea from Locally Linear Embed¬ 
ding (LLE) [27] to find the reconstruction coefficients. They 
represented each sample by a linear combination of its neigh¬ 
bors, which can well preserve the local geometrical structure 
of the data. Similar to [32], Hu et al. 15] incorporated the 
local geometrical information into the active learning pro¬ 
cess. Specifically, they introduced a regularization term to 
make the nearer neighbors have much effect on the linear re¬ 
construction of data point, and penalized the selected sam¬ 
ples distant from the reconstructed sample severely. Nie et 
al. |24] proposed a novel method to relax the objective of 
TED to an efficient convex formulation, and utilized the ro¬ 
bust sparse representation loss function to reduce the effect 
of outliers. 


4. EXPERIMENT 

We evaluate the performance of our ALFS on six real- 
world datasets, including one speech signal processing area 
dataset LSVT Voice Rehabilitation, two preprocessed mi¬ 
croarray data sets from [23], i.e., Lung and Glioma, one face 
aging estimation dataset FG-NET (Five age groups are es¬ 
timated in the experiment), one video dataset Libras Move¬ 
ment, and one physical area dataset Musk. The LSVT Voice 
Rehabilitation dataset, the Libras Movement dataset, and 


Table 1: Summary of experimental datasets. 

# Feature, ^Sample, ^ Train, ^Test, and -#■ Cat de¬ 
note the number of features, the number of sam¬ 
ples, the number of training samples, the number 
of testing samples, and the number of categories, 
respectively. 


Dataset 

#Feature 

#Sample 

#Train 

#Test 

# Cat 

LSVT V. R. 

309 

126 

50 

76 

2 

Musk 

168 

476 

50 

426 

2 

Lung 

3312 

203 

50 

153 

5 

Glioma 

4434 

50 

30 

20 

4 

FG-NET 

907 

1002 

100 

902 

5 

Libras Mov. 

90 

360 

200 

160 

15 


the Musk dataset are downloaded from UCI Machine Learn¬ 
ing Repository Datasets from different areas serve as a 
good test bed for a comprehensive evaluation. Table [l] sum¬ 
marizes the details of the datasets used in the experiments. 


Since ALFS is related to the second group active learning 
algorithms, we compare it with some classic approaches in 
this group to demonstrate the effectiveness of ALFS. The 
compared methods in the experiments are listed as follows: 


• Random Sampling (RS): randomly selects samples from 
the training dataset, which is used as the baseline for 
active learning. 


Transductive Experimental Design (TED) 31 [ 


tive learning method developing experimental design 
in a transductive setting. 


RRSS 24| 3 4 5 an active learning method taking advan¬ 


tage of robust representation and structure sparsity. 


• Active Learning via Neighbor Reconstruction (ALNR) 
[15] : an active learning method using neighborhood 
reconstruction. 


• R-CUR [22]: a randomized algorithmic approach for 
solving CUR matrix factorization. We name it R-CUR 
for short. 


• ALFS: our proposed method for selecting representa¬ 
tive sample and feature selection simultaneously. 

To show the benefit of simultaneous active sample se¬ 
lection and feature selection, we also compare our ALFS 
against some feature selection approaches combined with 
the above active learning approaches, i.e., first using feature 
selection methods to reduce the dimensionality, and then ap¬ 
plying the above active learning methods to select samples 
based on the new low-dimensional representation. We use 
two kinds of feature selection methods, i.e., Q — a 30 and 
variance (Features corresponding to the maximum variance 
are selected to obtain the best expressive features), to com¬ 
bine with the active learning algorithms in the experiments, 
respectively. 

3 http://archive.ics.uci.edu/ml/datasets.html. 

4 A sequential solver can be downloaded from http://www. 
dbs.ifi.lmu.de/~yu_k/ted/ 

°The code can be downloaded from http://www.escience. 
cn/people/fpnie/papers.html 
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Figure 2: Comparison of different methods on six benchmark datasets. The curve shows the learning accuracy 
over queries. 


In the experiments, for each dataset, we randomly divide 
the dataset into two parts: the training set and the testing 
set, which is shown in Table [l] We apply the compared 
methods on the training set to select a certain number of 
samples for querying. After that, we learn the same decision 
tree classifier for all the methods based on these labeled 
samples, and evaluate the representativeness of the selected 
samples in terms of classification accuracy on the testing 
set. We repeat every test case for 10 times, and report the 
average classification performance. 

There are some parameters to be set in advance. The 
parameters a, /3, and 77 in our algorithm are searched from 
{0.1,1,10,100}. The parameter A is always set to 1 (We 
found that when A = 1, the performance was consistently 
good on all the datasets). For a fair comparison, the pa¬ 
rameters in TED, RRSS, and ALNR are also searched from 
{ 0 . 1 , 1 , 10 , 100 }. 

4.1 Experimental Result 

4.1.1 Comparison with Active Learning Algorithms 

In order to demonstrate the effectiveness of our ALFS 

in selecting representative samples, we compare ALFS with 
some state-of-the-art active learning algorithms. For ALFS 
and R-CUR, we vary the number of selected features from 
10 to 100 with an incremental step of 10 on all the datasets 
except the Libras Movement dataset. Since the original fea¬ 
ture dimension of the Libras Movement dataset is 90, we 
vary the number of selected features from 10 to 80 with an 


incremental step of The results are reported in Figure 
[2] We can observe that our method achieves better per¬ 
formance than all the other candidates. Taking the Libras 
Movement dataset as an example, when the number of the 
selected samples is set to 150, ALFS obtains a classification 
result of 55.8%, attaining 10.7% relative improvement over 
the second best result, i.e., ALNR. This result shows that 
feature selection is beneficial to select representative sam¬ 
ples for active learning. In addition, we note that R-CUR is 
not better than TED, RRSS and ALNR on almost all the 
datasets, and even worse on the FG-NET dataset and the 
LSVT Voice Rehabilitation dataset. The reason is that R- 
CUR [ 22 ] is a general CUR model and adopts a randomized 
algorithmic approach to seek the matrices C and R for satis¬ 
fying 0 - It does not consider it as an optimization problem, 
making the selected samples and features hardly be the most 
representative, which limits R-CUR to be directly applied to 
active learning. 

4.1.2 Comparison with Feature Selection + Active 
Learning 

In order to demonstrate the necessary of simultaneous 
sample and feature selection, we compare ALFS with some 
feature selection methods combined with the active learn¬ 
ing algorithms. We fix the number of the selected sam¬ 
ple, and test the classification accuracies with different fea¬ 
ture dimensions. Figures [3] and [4] demonstrate the results. 
We can see that our method outperforms those approaches 

6 When the user inputs the desired number of samples m 
and the number of features r, the final outputs m' and r' of 
R-CUR [22 may be slightly different m and r, respectively. 
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Figure 3: Comparison of Q + active learning algorithms on all the six datasets. Here, ‘Q’ denotes the feature 
selection method Q — a. The curve shows the learning accuracy over features. 
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Figure 4: Comparison of Var + active learning algorithms on all the six datasets. Here, ‘Var’ denotes the 
feature selection method Variance. The curve shows the learning accuracy over features. 






































































































Figure 5: CPU time vs. convergence tolerance e. 

treating sample selection and feature selection as two sep¬ 
arate steps. Still taking the Libras Movement dataset as 
an example, when the number of selected features is set to 
30, ALFS achieves 24.9% relative improvement over Q — a 
combined with ALNR, and 24.2% relative improvement over 
variance combined with ALNR. This indicates that simulta¬ 
neous sample and feature selection is promising for obtaining 
good performance. In addition, we can observe that active 
learning combined with feature selection indeed improve the 
performance of only active learning under most of the cases. 

4.2 CPU Time and Sensitivity Analysis 

We test the CPU running time with different convergence 
tolerance e on the FG-NET and Libras Movement datasets. 
The experiments are conducted on machines with Intel (R)- 
Core(TM) CPUs of 3.20 GHz and 4 GB RAM, and ALFS is 
implemented using MATLAB R2014b 64bit edition without 
parallel operation. The result is shown in Figure [5] The 
CPU time grows linearly with e increasing on the FG-NET 
dataset, while it is not sensitive to e on the Libras Movement 
dataset. 

We also study the sensitivity of parameters a, /3, 7 , and 
77 in our algorithm on the Libras Movement dataset. Figure 
[ 6 ] shows the results. With the fixed feature dimensions, our 
method is not sensitive to a, /3, 7 , and 77 with wide ranges. 

5. CONCLUSIONS AND FUTURE WORK 

In this paper, we present a unified framework to simulta¬ 
neously conduct active sample learning and feature selection 
(ALFS). Given an unlabeled dataset, our formulation natu¬ 
rally and effectively incorporates feature and sample selec¬ 
tion by solving a regularized optimization problem rooted 
from CUR factorization. We further relax the original NP- 
hard non-convex problem into a convex one by introducing 
the structured sparsity-inducing norms, which allows for ef¬ 
ficient iterative optimization algorithm (ADMM). The su¬ 
perior performance of our method over the state-of-the-art 
methods is verified by extensive experimental evaluations 
with six benchmark datasets. 

Several interesting directions can be followed up, which 
are not covered by our current work: 

• Leveraging labeled samples: ALFS selects samples 
and features from a perspective of data reconstruc¬ 
tion in an unsupervised setting. If label information 
is available, we can incorporate such prior information 



(c) (d) 


Figure 6: Sensitivity study of the parameters on the 
Libras Movement dataset. 

into our framework, e.g., taking the objective function 
of [25] as a regularization term. This would be helpful 
if a specific prediction task is actually only relevant to 
a few features and our ‘blind’ feature selection method 
may keep unnecessary features although they are in¬ 
dispensable to represent the sample set itself. 

• Online learning: ALFS works in a batch mode, i.e., 
the unlabeled dataset is available. We can further ex¬ 
tend our work to online learning mode, such that ALFS 
can efficiently and effectively handle the case when new 
samples are coming in. 

• Additional regularization terms: In our work, mo¬ 
tivated by the local reconstruction philosophy, we add 
the cross-sample regularization term as presented in 
Sec.2.3. This term alleviates the under-determination 
condition of the factorization problem, and contributes 
to the robustness of our method. Symmetrically, a 
cross-feature regularization term can be also applied. 
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